An Awkward Disparity between BLEU / RIBES Scores and Human Judgements in Machine Translation
نویسندگان
چکیده
Automatic evaluation of machine translation (MT) quality is essential in developing high quality MT systems. Despite previous criticisms, BLEU remains the most popular machine translation metric. Previous studies on the schism between BLEU and manual evaluation highlighted the poor correlation between MT systems with low BLEU scores and high manual evaluation scores. Alternatively, the RIBES metric—which is more sensitive to reordering—has shown to have better correlations with human judgements, but in our experiments it also fails to correlate with human judgements. In this paper we demonstrate, via our submission to the Workshop on Asian Translation 2015 (WAT 2015), a patent translation system with very high BLEU and RIBES scores and very poor human judgement scores.
منابع مشابه
Dependency Analysis of Scrambled References for Better Evaluation of Japanese Translation
In English-to-Japanese translation, BLEU (Papineni et al., 2002), the de facto standard evaluation metric for machine translation (MT), has very weak correlation with human judgments (Goto et al., 2011; Goto et al., 2013). Therefore, RIBES (Isozaki et al., 2010; Hirao et al., 2014) was proposed. RIBES measures similarity of the word order of a machine-translated sentence and that of a correspon...
متن کاملA machine translation system combining rule-based machine translation and statistical post-editing
System architecture, experimental settings and evaluation results of the EIWA in the WAT2014 Japanese to English (jaen) and Chinese to Japanese (zh-ja) tasks are described. Our system is combining rule-based machine translation (RBMT) and statistical post-editing (SPE). Evaluation results for ja-en task show 19.86 BLEU score, 0.7067 RIBES score, and 22.50 human evaluation score. Evaluation resu...
متن کاملA Human Judgement Corpus and a Metric for Arabic MT Evaluation
We present a human judgements dataset and also an adapted metric for evaluation of Arabic machine translation. Our medium-scale dataset is first of its kind for Arabic with high annotation quality. We use the dataset to adapt the BLEU score for Arabic. Our score (AL-BLEU) provides partial credits for stem and morphological matchings of hypothesis and reference words. We evaluate BLEU, METEOR an...
متن کاملDependency-based Automatic Enumeration of Semantically Equivalent Word Orders for Evaluating Japanese Translations
Scrambling is acceptable reordering of verb arguments in languages such as Japanese and German. In automatic evaluation of translation quality, BLEU is the de facto standard method, but BLEU has only very weak correlation with human judgements in case of Japanese-toEnglish/English-to-Japanese translations. Therefore, alternative methods, IMPACT and RIBES, were proposed and they have shown much ...
متن کاملModifications of Machine Translation Evaluation Metrics by Using Word Embeddings
Traditional machine translation evaluation metrics such as BLEU and WER have been widely used, but these metrics have poor correlations with human judgements because they badly represent word similarity and impose strict identity matching. In this paper, we propose some modifications to the traditional measures based on word embeddings for these two metrics. The evaluation results show that our...
متن کامل